| Patrice Riemens on Tue, 24 Mar 2009 06:37:38 -0400 (EDT) | 
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
| <nettime> Ippolita Collective: The Dark Face of Google, Chapter 4 (First part) | 
NB this book and translation are published under Creative Commons license 
2.0 (Attribution, Non Commercial, Share Alike).
Commercial distribution requires the authorisation of the copyright 
holders:
Ippolita Collective and Feltrinelli Editore, Milano (.it)
Ippolita Collective
The Dark Side of Google (continued)
Chapter 4  Algorithms or Bust! (Part 1)
Google's mind-boggling rate of growth has not at all diminished its 
reputation as a fast, efficient, exhaustive, and accurate search engine: 
haven't we all heard the phrase "if it's not on Google, it doesn't 
exist!", together with "it's faster with Google!". At the core of this 
success lies, besides elements we have discussed before, the PageRank[TM] 
algorithm /we mentioned in the introduction/ which steers Google's 
spider's forays through the Net.  Let's now look more closely at what it 
is, and how it works.
Algorithms and real life
An algorithm [*N1] is a method to resolve a problem, it is a procedure 
built up of sequences of simple steps leading to a certain {desired} 
result. An algorithm that actually does solve a problems is said to be 
accurate, and if it does so speedily, it is also efficient. There are many 
different types of algorithms, and they are used in the most diverse 
scientific domains. Yet, algorithms aren't some kind of arcane procedures 
concerning {and known} only {to} a handful of specialists, they are 
devices that profoundly influence our daily lives, much more so than would 
appear at first sight.
Take for instance the technique used to tape a television programme: based 
on algorithms; but so also methods to put a pile of papers in order, or to 
sequence the stop-overs of a long journey. Within a given time, by going 
through a number of simple, (re)replicable steps, we make a more or less 
implicit choice of algorithms that apply to the problem solving issue at 
hand. 'Simple' in this regard, means foremost unequivocal, readily 
understandable for who will put the algorithm to work. Seen in this light, 
a kitchen recipe is an algorithm: "bring three liters water to the boil in 
a pan, add salt, throw in one pound of rice, cook for twelve minutes and 
sieve, serve with a sauce to taste", all this is a simple step-by-simple 
step description {of a cooking process}, provided the reader is able to 
interpret correctly elements such as "add salt", and "serve with a sauce 
to taste".
Algorithms are not necessarily a method to obtain completely detailed 
results. Some are intended to arrive at acceptable results {within a given 
period of time} [French text: 'without concern for the time factor' - 
which doesn't sound very logical to me -TR]; others arrive at results 
through as few steps as possible; yet others focus on using as few 
resources as feasible [*N2].
It should also be stressed /before going deeper into the matter/ that 
nature itself is full of algorithms. Algorithms really concern us all 
because they constitute concrete practices meant to achieve a given 
objective. In the IT domain they are used to solve recurrent problems in 
software programming, in designing networks, and in building hardware. 
Since a number of years, due to the increasing importance of network-based 
reality analysis and interpretation models, many researchers have focused 
their studies on the construction methods and network trajectories of the 
data which are the 'viva materia' of algorithms.  The 'economy of search' 
John Batelle writes about [*N3] has become possible thanks to the {steady} 
improvement of the algorithms used for information retrieval, developed in 
order to augment the potential of data discovery and sharing, this with an 
{ever} increasing degree of efficiency, speed, accuracy, and security. The 
instance the general public is the most familiar with is the 
'peer-to-peer' {('P2P')} phenomenon: instead of setting up humongous 
databases for accessing videos, sound, texts, software, or any other kind 
of information {in digital format}, ever more optimised algorithms are 
being developed all the time, facilitating the creation of extremely 
decentralised networks, through which any user can make contact with any 
other user in order to engage in {mutually} beneficial exchanges.
The Strategy of objectivity
The tremendous increase of the quantity and quality of bandwidth, and of 
memory in our computers, together with rapidly diminishing costs, has 
enabled us to surf the Internet longer, better, and faster. Just twenty 
years ago, modems, with just a few hundred bauds (number of 'symbols' 
transmitted per second) of connectivity, were the preserve of an elite. 
Today, optic fiber criss-crosses Europe, carrying millions of bytes per 
second, and is a technology accessible to all. Ten years ago, a fair 
amount of technical knowledge was required to create digital content. 
Today, the easiness of publishing on the World Wide Web, the omnipresence 
of e-mail, the improvement of all kinds of online collective writing 
systems, such as blogs, wikis, portals, mailing lists, etc. together with 
the dwindling costs of registering Internet domains and addresses, have 
profoundly changed the nature of users: from simple users of information 
made available to them by IT specialists, they have increasingly become 
creators of information themselves.
The increase in the quality of connectivity goes together with an 
exponential augmentation of the quantity of data send over the networks, 
which, as we have pointed out earlier, entails the introduction of 
steadily better performing search instruments. The phenomenon that 
represents this pressing necessity exerts a deep attraction on social 
scientists, computer science people, ergonomists, designers, specialist in 
communication, and a host of other experts. On the other hand, the 
'informational tsunami' that hits the global networks cannot be 
interpreted as mere 'networkisation' of societies as we know them, but 
must be seen as a complex phenomenon needing a completely fresh approach. 
We therefore believe that such a theoretical endeavour cannot be left to 
specialists alone, but demand a collective form of elaboration.
If indeed the production of DIY network constitutes an opportunity to link 
autonomous realms together, we must also realise that the tools of social 
control embedded in IT technologies represent a formidable apparatus of 
repression.
The materialisation of this second scenario, most spectacularly 
exemplified by the Echelon eavesdropping system [*N5], looks 
{unfortunately} the most probable, given the steadily growing number of 
individuals who are giving information away, as opposed to an ever 
diminishing number of providers of search tools. The access to the 
information that is produced by this steadily growing number of 
individuals is managed with an iron hand by people who are both retaining 
the monopoly of it while at the same time reduce what is a tricky social 
issue into a mere marketing free-for-all contest where the best algorithm 
wins.
A search algorithm is a technical tool activating an extremely subtle 
marketing mechanism, as the user trust that the search returns are not 
filtered and correspond to choices made by the 'community' of surfers. To 
sum up, a trust mechanism is triggered into the objectivity of the 
technology itself, recognised as 'good' because it is free from human 
individuals' usual idiosyncratic influences and preferences. The 'good' 
machines, themselves issued from 'objective' science, and 'unbiased' 
research, will not tell lies since they cannot lie, and in any case don't 
have any interest in doing so. Reality, however, is very much at variance 
with this belief, which proves to be a demagogic presumption - the cover 
for fabulous profits from marketing and control.
Google's case is the most blatant example of this technology-based 
'strategy of objectivity'. Its 'good by definition' search engine keeps 
continuous track of what its users are doing in order to 'profile' their 
habits and exploits this information by inserting personally targeted and 
contextualised ads into all their activities (surfing, e-mailing, file 
handling, etc.). 'Lite' ads for sure, but all pervasive, and even able to 
generate feedback, so that users can, in the simplest way possible, 
provide information to vendors, and thus improve the 'commercial 
suggestions' themselves by expressing choices. This continuous soliciting 
of users, besides flattering them into thinking that are participants in 
some vast 'electronic democracy', is in fact the simplest and most 
cost-effective way to obtain commercially valuable information about the 
tastes of consumers. The users' preferences and their ignorance {about the 
mechanism unleashed on them}) is what constitutes and reinforces the 
hegemony of a search engine, since a much visited site can alter its 
content as consequence of the outcome of its 'commercial suggestions': a 
smart economic strategy indeed.
Seen from a purely computer science point of view, search engines perform 
four tasks: retrieving data from the Web (spider); stocking information in 
appropriate archives (databases); applying the correct algorithm to order 
data in accordance with the query, and finally, presenting results on an 
interface in a manner that satisfies the user. The first three tasks each 
requires a particular type of algorithm: search & retrieval; memorisation 
& archiving; and query. Google's power, just as Yahoo!'s and other search 
giants /on the network/ is therefore based on: 1. A 'spider', that is a 
piece of software that captures content /on the net/ 2. An enormous 
capacity to stock data on secure carriers, and a lot of backup facilities, 
to avoid any accidental loss of data. 3. An extremely fast system able to 
retrieve and order the returns of a query, according to the ranking of the 
pages. 4. An interface at the user's side to present the returns of the 
queries requested (Google Desktop and Google Earth, however, are 
programmes the user must install on her/his machine beforehand).
Spiders, databases and searches
The spider is an application that is usually developed in the labs of the 
search engine companies. Its task is to surf web pages from one link to 
the next while collecting information, such as document format, keywords, 
page authors, next links, etc. When done with its /exploratory/rounds, the 
spider software sends all this to the database for archiving /this 
information/. Additionally, the spider must monitor any changes on the 
sites visited so as to be able to programme its next visit and stock fresh 
data. The Google spider, for instance, manages two types of site-scans, 
one monthly and elaborate, the so-called 'deep crawl', the other daily, 
'fresh crawl', for updating purposes. This way, Google's databases are 
continuously updated /by the spider through its network surfing/. After 
every 'deep crawl', Google needed a few days to actualise the various 
indexes and to communicate the new results to all {its} data-centers. This 
lag time is known as the "Google Dance": the search returns used to be 
variable, since they stemmed from different indexes. But Google has 
altered its cataloguing and updating methods from 2003 onwards, and has 
also spread them much more in time, resulting in a much less pronounced 
'dance': now the search results vary in a dynamic and continuous fashion, 
and there are no longer periodic 'shake-ups'. In fact, the search returns 
will even change according to users' surfing behaviour, which is archived 
and used to 'improve', that is to 'simplify' the identification of {the} 
information {requested} [*N6].
The list of choices the application is working through in order to index a 
site is what constitutes the true force of the Google algorithm. And while 
the PageRank[TM] algorithm is patented by Stanford, and is therefore 
public, later alterations have not been /publicly/ revealed by Google, 
nor,{by the way}, by any other search engine company existing at the 
moment. And the back-up and recovery methods used in the data centers are 
not being made public either.
Again, from a computer science point of view, a database is merely an 
archive in digital format: in its simplest, and till now also its most 
common form, it can be represented as one of more tables which are linked 
together and which have enter and exit values: these are called relational 
databases. A database, just like a classic archive, is organised according 
to precise rules regarding stocking, extraction and continuous enhancement 
of {the quality of} the data /themselves/ (think recovery of damaged data, 
redundancy avoidance, continuous updating of data acquisition procedures, 
etc). IT specialists have been studying for decades now the processes of 
introduction, quality improvement, and search and retrieval within 
databases. To this end, they have experimented with various approaches and 
computer languages (hierarchies rankings, network and relational 
approaches, object oriented programming, etc.). The building up of a 
database is crucial component of the development of a complex information 
system such as Google's, as its functionality is entirely dependent on it. 
In order to obtain a swift retrieval of data, and more generally, an 
efficient management of the same, it is essential to identify correctly 
what the exact purpose of the database is (and, in the case of relational 
databases, the purpose of the tables) which must be defined according to 
the domains and the relations that link them together. Naturally, it 
becomes also necessary to allow for approximations, something that is 
unavoidable when one switches from natural, analog languages to digital 
data. The itch resides in the secrecy of the methods: as is the case with 
all proprietary development projects, as opposed to those which are open 
and free, it is very difficult to find out which algorithms and programmes 
have been used.
Documents from research centers and universities allow a few glimpses of 
information on proprietary projects, as far as it has been made public. 
They contain are some useful tidbits to understand the structure of the 
computers {used} and the way search engines are managing data. Just to 
give an idea of the computing power available today, one finds 
descriptions of computers which are able to resolve in 0,5 microsecond 
Internet addresses into the unique bits sequences that serve to index in 
databases, while executing 9000 spiders {'crawls') at the same time. These 
systems are able to memorize and analyze 50 million web pages a day [*N7].
The last algorithmic element hiding behind Google's 'simple' facade is the 
search system, which, starting from a query by the user, is able to find, 
order, rank and finally return the most pertinent results to the 
interface.
A number of labs and universities have by now decided to make public their 
research in this domain, especially regarding answers {to problems} that 
have been found, and the various methods used to optimise access speed to 
the data, {questions about} the complexity of the systems, and the most 
interesting instances of parameter selection.
Search engines must indeed be able to provide almost instantaneously the 
best possible results while at the same time offering the widest range of 
choice. Google would without doubt appear as the most advanced search 
engine of the moment: as we will see /in details/ in the next chapter, 
these extraordinary results cannot but be the outcome of a very 
'propitious' {form of} filtering...
For the time being, suffice to say that the best solution resides in a 
proper balance between computing power and the quality of the of the 
search algorithm. You need truly extraordinary archival supports and 
indexation systems to find the information you are looking for when the 
mass of data is written in terabytes (1 TB = 1000 gigabytes = 1000 raised 
to 3 bytes), or even in petabytes (1 PB = 1000 TB [or 1024 TB , 
Wikipedia's funny...-TR]), and also a remarkable ability to both determine 
where the information is in the gigantic archive and to calculate the 
{fastest} time needed to retrieve it.
And as far as Google's computing capacities are concerned, the Web is full 
of - not always verifiable nor credible - {myths and} legends, especially 
since the firm is not particularly talkative about its technological 
infrastructure. Certain sources are buzzing about lakhs [See Chapter 1 
;-)] of computers interconnected through thousands of gigantic 'clusters' 
[sitting on appropriate GNU/Linux distros - French text unclear]; others 
talk about mega-computers, whose design comes straight out SciFi 
scenarios: humongous freeze-cooled silo's where a forest of mechanical 
arms move thousands of hard disks at lightning speed. Both speculations 
are just as plausible {or fanciful}, and do not necessarily exclude each 
other. In any case, it is obvious that the extraordinary flexibility of 
Google's machines allows for exceptional performances, as long as the 
system remains 'open' - to continuous {in-house} improvements, that is.
(to be continued)  
--------------------------
Translated by Patrice Riemens
This translation project is supported and facilitated by:
The Center for Internet and Society, Bangalore
(http://cis-india.org)
The Tactical Technology Collective, Bangalore Office
(http://www.tacticaltech.org)
Visthar, Dodda Gubbi post, Kothanyur-Bangalore
(http://www.visthar.org)
#  distributed via <nettime>: no commercial use without permission
#  <nettime>  is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: http://mail.kein.org/mailman/listinfo/nettime-l
#  archive: http://www.nettime.org contact: nettime@kein.org